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Abstract 

Background: A variety of species and experimental designs have been used to study genetic influences on 
alcohol dependence, ethanol response, and related traits. Integration of these heterogeneous data can be used to 
produce a ranked target gene list for additional investigation. 

Results: In this study, we performed a unique multi-species evidence-based data integration using three 
microarray experiments in mice or humans that generated an initial alcohol dependence (AD) related genes list, 
human linkage and association results, and gene sets implicated in C. elegans and Drosophila. We then used 
permutation and false discovery rate (FDR) analyses on the genome-wide association studies (GWAS) dataset from 
the Collaborative Study on the Genetics of Alcoholism (COGA) to evaluate the ranking results and weighting 
matrices. We found one weighting score matrix could increase FDR based q-values for a list of 47 genes with a 
score greater than 2. Our follow up functional enrichment tests revealed these genes were primarily involved in 
brain responses to ethanol and neural adaptations occurring with alcoholism. 

Conclusions: These results, along with our experimental validation of specific genes in mice, C elegans and 
Drosophila, suggest that a cross-species evidence-based approach is useful to identify candidate genes contributing 
to alcoholism. 



Background 

Research on the genetics and neurobiology of alcoholism 
uses a variety of study designs and model organisms. A 
wealth of data are available, including linkage studies in 
human alcoholics, microarray studies of inbred mouse 
strains' brains and rat brains exposed to ethanol, and stu- 
dies of loss or gain of function of genes in organisms 
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such as C. elegans and Drosophila [1,2]. Although results 
or information across experiments are often compared 
by individual researchers in order to generate hypotheses, 
interpret results, or prioritize targets for follow up inves- 
tigations [3], these analyses are not always done compre- 
hensively and rarely include a cross-species approach 
[4-7]. While data integration itself can be challenging, 
how best to utilize combined results is also unclear. 
Although pooled results may yield valuable insights, 
there are potential benefits of using more systematic 
approaches to generate quantitative rankings that can 
then, in turn, guide additional studies. In particular, these 
rankings could be applied to choosing molecular targets 
for knockdown studies in model organisms or genetic 
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association studies in humans. For this and other 
approaches, evaluation is often needed in order to deter- 
mine whether the rankings are effective at the end of the 
data integration process. 

Challenges exist for each stage of an integration pro- 
cess, including the creation of an empirical gene list 
across species and platforms, scoring the information, 
and then evaluating the scoring system itself. For exam- 
ple, once various data are collected, identifying the best 
way to integrate them poses a problem since the criteria 
for selecting gene lists often differ substantially across 
studies. Specifically in microarray studies, the expression 
of gene-specific transcripts is selected via statistical 
threshold(s), but individual genes can have multiple 
transcripts that may differ in their abundance [8]. 
Therefore, a given gene can yield multiple expression 
values through microarray or next generation RNA 
sequencing (RNA-Seq) analyses. Likewise, human 
genetic association studies test multiple genetic markers, 
usually single nucleotide polymorphisms (SNPs), across 
a gene. In contrast, the results of genetic linkage or 
quantitative trait locus (QTL) studies in humans or 
mice can span tens of megabases and contain potentially 
hundreds of genes. Furthermore, low replication rates 
and identification of non-functional markers in most 
studies makes the search for true genetic signals difficult 
[9-11]. While there are issues with data reduction or 
summarization, integration at the level of the gene can 
be used as a link across a number of commonly used 
approaches. 

If genetic information is summarized at the gene level, 
then each gene in the genome can be assigned a score for 
each experiment or data set available. This measurement 
can be quantitative or qualitative. For example, p-values 
may be assigned to a gene within a quantitative trait 
locus (QTL) or a linkage region. However, differences in 
gene-specific p-values within an interval of interest may 
be misleading since linkage peaks can shift, and variants 
responsible for the linkage may not be at the peak itself. 
In contrast, large numbers of genes may show altered 
expression in microarray studies and represent real 
changes due to signal cascades affecting entire gene net- 
works. These correlated expression networks, in which a 
large number of changes are expected, contrast with link- 
age regions, in which most if not all genes do not actually 
contain variant(s) linked to the disease. A combined p- 
value method can be used for quantitative analyses, but 
this approach may present its own challenges. The indivi- 
dual data sources may not be weighted equally since the 
relative magnitudes of the p-values can be vastly different 
across platforms (e.g., mouse and human QTL studies). 
To avoid such issues, qualitative scores that measure the 
presence or absence of evidence above a threshold may 



be used, but thresholds have their own problems. Regard- 
less of scoring choice, and despite some problems asso- 
ciated with each, a combined gene rank score can be 
generated from data integration. These gene rank scores 
can be used to perform weighted analysis or to define 
gene subsets for further investigation. 

The effectiveness of such ranking can be verified by 
conducting further testing on genes ranked highly in the 
analyses. Alternatively, because the design of genome- 
wide association studies (GWAS) is hypothesis free, this 
approach offers opportunities to empirically test a rank- 
ing method and provide insight into further refinement, 
and all or most potential candidate genes can be tested 
in one experiment. If higher ranked genes contain more 
significant SNPs than a random set of genes, then the 
utility of a cross-species and platform integration and 
ranking approach would be demonstrated [12]. In this 
report, we attempt to implement and evaluate the utility 
of the approach outlined above by collecting data across 
species and approaches, summarizing at the gene level, 
ranking the genes, and testing the rankings in complex 
traits related to alcoholism and ethanol response. We 
included data generated from ethanol response experi- 
ments because this trait is one of the contributing fac- 
tors for alcoholism [13]. 

Results 

Ranked gene list 

An initial list of 2458 genes that show altered expression 
in mouse brain in response to ethanol in two previous 
studies [3,14] was used as a starting point. These datasets 
were abbreviated as MuAc and MuPref (see Figure 1). 
Five additional data sources were used to construct a 
score for these genes, including 1) genes showing altered 
expression in the prefrontal cortex of human alcoholics 
(abbreviated as HuAlc) [15], 2) linkage intervals from 
published studies of the Collaborative Study on the 
Genetics of Alcoholism (COGA) and the Irish Study of 
Alcoholism samples (abbreviated as HuLink) [16-18], 3) 
genes contained on a human addiction/alcoholism array 
(abbreviated as HuAddChip) [19], 4) those from a smaller 
list of ethanol-related genes compiled from Drosophila 
(abbreviated as Dr) [20,21], and 5) a short list of ethanol- 
related genes compiled from C. elegans (abbreviated as 
Ce) [22]. Additionally, genes having cross-species hits 
acquired bonus scores (the "Cross" score in our algo- 
rithm, see Table 1), as cross-species evidence was 
regarded as an important factor in gene salience. Here, 
we used score to estimate the evidence of a gene, rather 
than using a quantitative measurement (e.g., significance 
level, see section Materials and methods). We proposed 
10 weighting score matrices (Table 1). The correspond- 
ing ranking results are shown in Table 2. 
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MuAc- murine, acute ethanol response genes (Kerns et al., 2005) 
MuPref- murine, ethanol preference drinking genes (Mulligan et al., 2006) 

HuAlc- human, alcoholic prefrontal cortex genes (Mayfield et al., 2002) 
HuLink- human, linkage region genes (Chr. 1, 4, 7; multiple sources) 
HuAddChip- human, alcoholism association genes (Hodgkinson et al., 2008) 
Dr - Drosophila, ethanol response genes (Morozova etal., 2006 & 2007) 
Ce-C. elegans, ethanol response genes (Kwon etal., 2004) 
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Figure 1 Data sources and ranking score results using weighting score matrix 3. The details of weighting score matrix 3 are provided in 
Table 1. 



Alcohol dependence GWAS analysis for ranked genes 

To assess the performance of our ranking algorithm and 
weighting score matrices, we explored whether the 
ranked genes showed non-random enrichment of 



significant signals in alcoholism GWAS results. Specifi- 
cally, we examined enriched association signals of 
ranked genes in the COG A GWAS [23], one of the lar- 
gest alcohol dependence GWAS datasets. To increase 



Table 1 Ten weighting score matrices used in cross-species data integration and gene ranking. 



Weighting score matrix (WSM) 



Description 



WSM1 


(0.5,0.5,1,1,0.5,0.5,0.5,1) * 


HuAlc, HuAddChip, and Cross ** given 1, all others 0.5 


WSM2 


(0.5,0.5,1,1,0.5,0.5,0.5,0.5) 


HuAlc and HuAddChip given 1, all others 0.5 


WSM3 


(0.5,0.5,0.5,1,0.5,0.5,0.5,0.5) 


HuAddChip given 1, all others 0.5 


WSM4 


(0.5,0.5,1,1,1,0.5,0.5,0.5) 


All human data given 1, all others 0.5 


WSM5 


(1,1,1,1,0.5,0.5,0.5,0.5) 


All mouse data, human HuAlc and HuAddChip given 1, all others 0.5 


WSM6 


(1,1,0.5,0.5,0.5,0.5,0.5,0.5) 


All mouse data given 1, all others 0.5 


WSM7 


(1,1,1,1,1,0.5,0.5,0.5) 


All mouse and human data given 1, all others 0.5 


WSM8 


(0.5,0.5,0.5,1,0.5,0.5,0.5,1) 


HuAddChip and Cross given 1, all others 0.5 


WSM9 


(1,1,1,1,1,1,1,1) 


All given 1 


WSM 10: (1,1,1,1,1,1,1,0.5) 


All but Cross given 1 



* The weights in the parentheses are in the order for the datasets: MuAc, MuPref, HuAlc, HuAddChip, HuLink, Ce, Dr, Cross, respective. The description of these 
datasets is provided in Figure 1. 

** Cross denotes the bonus score for genes who had cross-species hits except for genes from human linkage regions (see section Materials and methods). 
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Table 2 Summary of the genes by their scores using 10 
weighting matrices*. 



Weighting 
matrix 








Score 












0.5 


1 


1.5 


2 


2.5 
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3.5 
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1 


1722 


535 


40 


38 


102 


21 
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1722 


535 


78 


102 


21 
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131 


41 


6 








<1 
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10 
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137 


112 
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1745 


202 


222 


34 


11 






7 
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89 
41 


29 
6 

149 


17 


29 


10 




1966 




314 


132 


17 


29 





* For each matrix, scores of each gene were summarized based on its 
evidence in each dataset (see details in Table 1}. Then, the number of genes 
in each score range was counted and summarized in this table. 



the effect of our analyses we filtered the data for minor 
allele frequency, Hardy- Weinberg equilibrium deviation, 
and failure rate (see section Materials and methods). 
This resulted in 958,380 SNPs in our follow up analysis, 
with an observed minimum p-value of 9.5 x 10" 7 . How- 
ever, the minimum q-value was 0.605 after False Discov- 
ery Rate (FDR) analysis. Of those SNPs approximately 
68.7% (658,008/958,380) mapped to the human non- 
pseudogenes in NCBI Entrez Gene database, and they 
were used in this study. FDR analysis was then per- 
formed on restricted subsets of markers based on gene 
rank score. 

For each ranked gene under different weighting score 
matrices, we calculated the q-value of each SNP in the 
gene from the COGA GWAS data. The results for the 
ten weighting score matrices were summarized in Addi- 
tional file 1. For each weighting score matrix, its gene 
ranking performance was expected to increase by 
improving the q-values of SNPs mapped in the ranked 
genes. To quantitatively measure performance and cor- 
rect for gene size, we conducted 100 simulations, in 
each of which the same number of genes were randomly 
chosen from the whole gene set. FDR based q-value 
analysis was then performed on the GWAS genotyped 
SNPs that mapped to the randomly chosen genes. The 
proportion of q-values in each q-value bin (e.g., 0.1-0.2) 
was calculated and then compared with those from the 
actual ranked alcohol genes. For the simplicity of com- 
parison, we separated q-values into different bins. The 
results are shown in Additional file 1. 

According to our permutation results, the weighting 
score matrix 3 had the best performance, since it gave 
the lowest q-values among genes (Additional file 1). 
This matrix was then used to refine gene scores using 
1000 permutations (Table 3). In general, the subset of 



Table 3 Empirical p-values estimated from 1000 
permutations based on weighting score matrix 3. 

q- Gene score 

value 







>0.5 


>1 


>1.5 


>2 


= 2.5 


q 


< 0.9 


0.393 


0.367 


0.603 


0.191 


0.052 






(6399) 


(1863) 


(399) 


(866) 


(178) 


q 


< 0.8 


0.525 (469) 


0.292 (415) 


0.441 


0.310 


0.131 










(163) 


(199) 


(108) 


q 


< 0.7 


0.495 (8) 


0.228 (164) 


0.627 (5) 


0.208 


0.118 (72) 












(117) 




q 


< 0.6 


0.286 (5) 


N/A 


N/A 


0.225 (42) 


0.093 (53) 


q 


< 0.5 


0.113 (5) 


N/A 


N/A 


0.122 (39) 


N/A 


q 


< 0.4 


0.012 (5) 


N/A 


N/A 


0.101 (27) 


N/A 



Based on Table 2, we used weight matrix 3 to rank genes (a total of 2458 
genes) and then separated them according to their q-values. The number of 
SNPs in each q-value and score category based on COGA dataset is shown in 
parentheses. N/A: not available due to absence of the real data at those 
categories. 



SNP results restricted to the scored genes was enriched 
for significant effects as the gene rank score increased 
from 0.5 to 2.0 (see Table 4). Specifically, the minimum 
FDR based q-value was 0.605 for all SNPs passing QC. 
The minimum q-value decreased for SNPs in all scored 
genes, but then increased for genes with score > 1 or > 
1.5. However, the minimum q-value became the smallest 
(0.357) when this analysis was applied to genes with 
score > 2. There were 47 genes whose scores were > 2, 
and a total of 2293 SNPs mapped to these genes. For 
this gene subset, we found many more SNPs having 
small q values, including 27 SNPs with q-value < 0.4 
and 39 SNPs with q-value < 0.5, than those in other 
gene sets (e.g., gene subset with score > 1.5 or any 
scored, Table 4). Although this q-value analysis was not 
perfect (e.g., we did not find a steady decrease of q- 
value by increasing gene score, Table 4), it suggests that 
multi-species gene ranking by optimal weighting matrix 
might be effective for prioritizing candidate genes for 
complex traits. 

Bioinformatics analysis of top ranked genes 

As presented above and detailed in Table 5, 47 genes 
with score > 2 had promising q-value improvement, and 
were used as our high priority list for follow up bioin- 
formatics analysis. These genes also had evidence from 
at least 2 different species (Figure 1). We first performed 
functional enrichment analysis of Gene Ontology (GO) 
terms implemented in the WebGestalt tool. In this tool, 
each gene set is tested its functional enrichment with 
GO annotations based on the hypergeometric test. As 
shown in Table 6 the most significantly enriched func- 
tional terms belonged to the groupings of neurotrans- 
mitter receptor activity, ion binding, and synaptic 
structure. The most significantly enriched functional 
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Table 4 Improvement in GWAS-based association FDR as multi-species gene ranking 


score increases. 




FDR 


Total SNPs passing QC 


Scored 


Score >1 


Score >1.5 


Score >2 


Score — 2.5 


q-value 














All SNPs 


547920 


91774 


18988 


7948 


2293 


210 


< 0.9 


48016 


6399 


1863 


399 


866 


178 


< 0.8 


643 


469 


-'115 


163 


199 


108 


< 0.7 


6 


8 


164 


5 


117 


72 


< 0.6 


0 


5 


0 


0 


42 


53 


< 0.5 


0 


5 


0 


0 


39 


0 


< 0.4 


0 


5 


0 


0 


27 


0 


< 0.3 


0 


0 


0 


0 


0 


0 


< 0.2 


0 


0 


0 


0 


0 


0 


< 0.1 


0 


0 


0 


0 


0 


0 


< 0.05 


0 


0 


0 


0 


0 


0 


Min q 


0.605 


0.369 


0.667 


0.636 


0.357 


0.526 



terms were behavior (pbh = 5.08 x 10" 5 ), gamma-amino- 
butyric acid (GABA) signaling pathway (pbh = 8.51 x 
1CT 5 ) and cell communication (p BH = 0.0001) in the GO 
category of "Biological Process"; GABA receptor activity 
(Pbh = 3.35 x 10" 9 ), GABA-A receptor activity (p BH = 
7.38 x 10~ 8 ), and neurotransmitter binding (pbh = 7.38 
x 10" ) in the category of "Molecular Function"; and 
postsynaptic membrane (pbh = 1-86 x 10" ), chloride 
channel complex (p BH = 2.76 x 10" ), and synapse part 
(Pbh = 3.36 x 10" 7 ) in the category of "Cellular Compo- 
nent". Many of these enriched functional categories are 
consistent with the current knowledge of alcohol depen- 
dence and ethanol response [24]. These indicate that the 
top ranked genes are highly enriched in functions rele- 
vant to alcoholism. 

To further investigate whether our approach to select- 
ing the 47 genes is efficient, we compared the results 
with a similar analysis of top-ranked genes based on p 
values in COGA GWAS. We assigned the smallest p 
value of the marker mapped to a gene to represent 
gene-wise association significance. Then, we selected the 
most significant 47 genes. No functional term was sig- 
nificant in GO term analysis. Of note, our results were 
not corrected for gene length bias, a potential problem 
in gene-based association studies [25]. This comparison 
suggested that our cross-species gene ranking method 
may be more useful in extracting biological meaning 
from gene lists. 

We further examined the function of the 47 genes 
selected by cross-species ranking by using the ToppFun 
online tool [26]. ToppFun provides enrichment analysis 
of candidate genes in many biological categories, includ- 
ing GO terms, biological pathways, human and mouse 
phenotypes, protein domains, and reference search in 
PubMed. We presented the results of ToppFun as com- 
plementary information for WebGestalt analysis and 
summarized the results of enriched pathways in Table 7 



enriched mouse phenotypes in Table 8 and enriched 
PubMed citations in Table 9. In the pathway analysis, 
ToppFun uses a comprehensive collection of pathways 
from major databases such as KEGG, Reactome, and 
BioCarta [26]. The most enriched pathway is neuroac- 
tive ligand-receptor interaction (p = 5.36 x 10" ). Other 
significant pathways included GPCR ligand binding and 
G alpha signaling events; here, GPCR denotes G pro- 
tein-coupled receptor (Table 7). Moreover, mouse phe- 
notype analysis revealed that our selected genes are 
involved in neuron-related activity (Table 8). Overall, 
pathway and mouse phenotype enrichment analyses 
confirmed the results obtained from the GO term 
enrichment analysis by WebGestalt, and these analyses 
also revealed highlighted genes related to synaptic activ- 
ity and GABA signaling as being particularly represented 
in significant pathways and mouse phenotypes. Finally, 
we queried PubMed references by ToppFun to search 
for publications that are overrepresented with genes 
from our top ranked list (Table 9). The highest scored 
record from this analysis was from a genetic study of 
gene expression associated with alcohol consumption in 
rats and humans [4], in which 24 of our top genes were 
represented in the total of 130 genes described by this 
study and showed significant enrichment (p < 1 x 10" 6 ). 
The second highest scored record was from an associa- 
tion study of 182 candidate genes in anorexia nervosa 
(enrichment p < 1 x 10" ). 

Discussion 

In this work, we applied a unique cross-species, evi- 
dence-based gene prioritization strategy for genes 
involved in alcoholism. We started with a set of genes 
with prior microarray expression evidence of involve- 
ment in ethanol response, representing approximately 
10% of the human protein-coding genes. These genes 
were ranked using additional sources of evidence across 



Zhao ef at. BMC Genomics 2012, 13(Suppl 8):S16 
http://www.biomedcentral.eom/1 471 -2 1 64/1 3/S8/S1 6 



Page 6 of 1 2 



Table 5 47 genes with ranked score > 2 using weighting 
score matrix 3. 



Gene symbol 
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-VI 0 


2 
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2 



multiple species, including humans, mice, C. elegans and 
Drosophila. We used the COGA GWAS dataset and 
applied permutation analysis to evaluate the best weight- 
ing score matrix for gene ranking. Based on these 
results, we selected the top 47 genes with the best evi- 
dence for follow up bioinformatics analysis. Our func- 
tional enrichment test of these 47 genes suggested that 
this ranking algorithm identifies gene sets with coherent 
biological functions relevant to brain responses to etha- 
nol and neural adaptations occurring with alcoholism. 
Remarkably, higher ranking scores were predictive of 
genes containing an enrichment of significant SNP asso- 
ciations in the context of COGA alcohol dependence 
GWAS results. These results provide initial evidence 
that a cross-species analysis of gene networks correlated 
with molecular or behavioral responses to ethanol may 
provide a powerful strategy to identify candidate genes 
that contribute to alcoholism. 

The identification of genes mediating biological 
responses to ethanol, including the modification of risk 
profiles for alcoholism, is an area of intense research 
interest due to the possibility of pinpointing targets for 
future alcoholism therapies. Recent advances in beha- 
vioral genetics and genomics have identified large num- 
bers of genes that potentially contribute to phenotypic 
responses to ethanol in both human and animal models. 
However, little progress has been made in narrowing or 
organizing these large lists of genes into a tractable 
scheme for understanding the neurobiology and genetics 
of alcoholism. One approach that has been used for 
large collections of microarray data has been the perfor- 
mance of a meta-analysis across data on rodent models 
of divergent ethanol drinking collected from multiple 
centers and strains [3]. However, this analysis identified 
3,800 genes associated with variation in ethanol intake, 
making downstream hypothesis-driven studies difficult 
to formulate. 

As discussed in the Background, in our research 
approach, we pursued a gene ranking algorithm con- 
structed to integrate data on ethanol-related genes 
across species. We recognized that direct behavioral par- 
allels with ethanol response across humans, mice, Droso- 
phila and C. elegans were likely to be tenuous or non- 
existent. However, molecular commonalities underlying 
ethanol responses across species, if they could be identi- 
fied, should provide a powerful validation mechanism 
for candidate genes involved in ethanol behavioral 
responses, even if those particular behavioral compo- 
nents differ across species. 

Our ranking algorithm, while largely empirical at this 
stage, identified a ranked list of genes with obvious 
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Table 6 Functional enrichment test for the 47 top ranked genes using WebGestalt 



GO ID* 


Term 


# genes 


p-value 


Pbh 




GO:0007610 (BP) 


Behavior 


11 


1 1 7 v 

1.1/ A 




5.08 x 


10- 5 


GO:0007214 (BP) 


Gamma-aminobutyric acid signaling pathway 


•'! 


3 92 x 


0" 7 


8.51 X 


10' 5 


GO:0007154 (BP) 


Cell communication 


29 


7 66 x 


0" 7 


0.0001 




GO:0007631 (BP) 


Feeding behavior 


5 


1 1 ft V 

Z. I o X 


rr 6 

U 


0.0002 




GO:0050794 (BP) 


Regulation of cellular process 


37 


3 00 X 


0" 6 


0.0002 




GO:0050789 (BP) 


Regulation of biological process 


38 


1 73 v 

1 ./ D X 


rr 6 


0.0002 




GO:0033555 (BP) 


Multicellular organismal response to stress 


A 


J.Z I X 


rr 6 
u 


0.0002 




GO:0007186 (BP) 


G-protein coupled receptor protein signaling pathway 


14 


A QQ v 


u 


0.0002 




GO:0007166 (BP) 


Cell surface receptor linked signal transduction 


19 


Z.JJ X 


no 
u 


0.0002 




GO:0065007 (BP) 


Biological regulation 


38 


8 70 X 


0" 6 


0.0003 




GO:0016917 (MF) 


GABA receptor activity 


6 


3 11 V 
J.ZZ X 




3.35 X 


10' 9 


GO:0004890 (MF) 


GABA-A receptor activity 


5 


1 v 

i .y d x 


rr 9 


7.38 X 


lO" 8 


GO:0042165 (MF) 


Neurotransmitter binding 


7 


-i n v 
Z,|J X 


IT 9 
U 


7.38 X 


lO" 8 


GO:0030594 (MF) 


Neurotransmitter receptor activity 


6 


3. JO X 


rr 8 


9.26 X 


10' 7 


GO:0005254 (MF) 


Chloride channel activity 


6 


0.U4 X 


IT 8 
U 


1.26 X 


10" 6 


GO:0005253 (MF) 


Anion channel activity 


6 


o nft \s 
y.uo x 


rr 8 
u 


1.35 X 


lO" 5 


GO:0031404 (MF) 


Chloride ion binding 


6 


o nft -*s 
y.uo x 


rr 8 
u 


1.35 X 


10" 6 


GO:0043168 (MF) 


Anion binding 


6 


Z.04 X 


rr 7 
u 


3.43 X 


lO" 5 


GO:0005230 (MF) 


Extracellular ligand-gated ion channel activity 


5 


i ii ^ 

Z.ZZ X 


rr 6 
u 


2.57 X 


10" 5 


GO:0008509 (MF) 


Anion transmembrane transporter activity 


6 


J.yy X 


ri" 6 

U 


4.15 X 


10' 5 


GO:0045211 (CC) 


Postsynaptic membrane 


8 


2.22 X 


0" 9 


1.86 X 


10" 7 


GO:0034707 (CC) 


Chloride channel complex 


6 


6.57 x 


0" 9 


2.76 X 


10' 7 


GO:0044456 (CC) 


Synapse part 


9 


1.20 x 


0" s 


3.36 X 


10' 7 


GO:0045202 (CC) 


Synapse 


9 


2.51 x 


O" 7 


5.27 X 


10" 6 


GO:0030054 (CC) 


Cell junction 


9 


5.88 x 


O" 6 


9.88 X 


10" 5 


GO:0034702 (CC) 


Ion channel complex 


6 


1.53 X 


0" 5 


0.0002 




GO:0044459 (CC) 


Plasma membrane part 


16 


1.70 X 


0" 5 


0.0002 




GO:0031226 (CC) 


ntrinsic to plasma membrane 


11 


0.0002 




0.0019 




GO:0005887 (CC) 


ntegral to plasma membrane 


11 


0.0002 




0.0019 




GO:0033267 (CC) 


Axon part 


3 


0.0003 




0.0025 





* BP: biological process; MF 
** BH: p-value corrected by 



molecular function; and CC: cellular component, 
the Benjamini-Hochberg method (1995) [50]. 



coherence in terms of functional gene networks. A 
remarkably large number of genes already validated as 
altering behavioral responses to ethanol were contained 
in the higher ranks. In addition, bioinformatics analysis 



showed several interesting biological functions that were 
over-represented among the ranked genes (Tables 6, 7, 
8, 9), which is largely consistent with our previous ana- 
lysis based on a network approach [27]. Again, a 



Table 7 Pathways significantly associated with top candidate genes by ToppFun. 



Pathway ID/name 


Description 


Source 


p-value 


Terms in 
query 


Terms in 
genome 


hsa04080 


Neuroactive ligand-receptor 
interaction 


KEGG pathway 


5.36 X 10~ 5 


10 


256 


REACTOME_GPCR_LIGAND_BINDING 


Genes involved in GPCR ligand 
binding 


MSigDB: C2.cp - 
Reactome 


2.66 X 1 0~ 3 


10 


392 


REACTOME_PEPTIDE_LIGAND_BINDING_RECEPTORS 


Genes involved in Peptide ligand- 
binding receptors 


MSigDB: C2.cp - 
Reactome 


4.36 X 1 0~ 3 


7 


173 


REACTOME_DOWNSTREAM_EVENTS_IN_GPCR_SIGNALING 


Genes involved in Downstream 
events in GPCR signaling 


MSigDB: C2.cp - 
Reactome 


8.63 X 1 0~ 3 


10 


448 


REACTOME_CLASS_A1_RHODOPSIN_LIKE_RECEPTORS 


Genes involved in Class A/1 
(Rhodopsin-like receptors) 


MSigDB: C2.cp - 
Reactome 


1 .62 X 1 0~ 2 


8 


292 


REACTOME_G_ALPHA_Q_SIGNALLING_EVENTS 


Genes involved in G alpha (q) 
signaling events 


MSigDB: C2.cp - 
Reactome 


2.78 X 1 0~ 2 


6 


157 
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Table 8 Mouse phenotypes significantly associated with top candidate genes by ToppFun. 



Phonntunp in 

I 1 ICI l*J Ly JJC IU 


Phonntwno namo 
r 1 ICl HJLV (JC 1 lul 1 IC 


p-value 


Tormc in niiofv/ 
ICIIII3 III IfUCI y 


Tormc in nonnmo 
iciiiia ill ytrinjiiit: 


!Vlr.(J(JUy/4D 


Abnormal behavioral response to xenobiotic 


0./ / X 


u 


1 2 


21 5 


IVlr.UUUzzUo 


Abnormal CNS synaptic transmission 


431 X 


0" 6 


I -1 


3 82 


Ivlr.UUUjDjD 


Abnormal synaptic transmission 


J.D / A 




1 A 


■ 1 jU 


IVlr.UUUzUoz 


Abnormal conditioning behavior 


A Q7 v 
u.o/ X 


rr 5 
u 


I U 


1 on 
i yy 


IVlr.UUUzUOD 


Abnormal learning/memory /conditioning 


9 26 x 


O" 5 


1 3 

I .J 


4UD 


Ivlr .UUUZUOD 


Abnormal fear/anxiety-related behavior 


i n^ v 

1 .UD X 


U 


1 n 

I u 


ZUo 


IVlr.UUU 1 ooz 


Abnormal anxiety-related response 


3 CQ v 

x 


u 


y 


1 7Q 

i /y 


IVlr.UUUzb/z 


Abnormal emotion/affect behavior 


O.DH X 


u 


1 1 


329 


MD-nnn 1 a ca 

IVlr.UUU I 434 


Abnormal cued conditioning behavior 


1 41 x 


0" 3 


0 


DO 


Ivlr.UUUjODD 


Abnormal nervous system physiology 


1 AA v 






1 33 3 


ivir.uuu i jyy 


Hyperactivity 


i 1;7 s/ 

Z.J / X 


u 


y 


lift 
ZZO 


IVlr.UUU I JoJ 


Increased anxiety-related response 


Z.oD X 


(T 3 
U 


7 


1 1 / 


MP:000935/ 


Abnormal seizure response to inducing agent 


9.04 x 


0" 3 


7 


139 


MP:0001449 


Abnormal learning/memory 


1.22 x 


0" 2 


10 


349 


MP:0003088 


Abnormal prepulse inhibition 


1.45 X 


0" 2 


5 


57 


MP:0003313 


Abnormal locomotor activation 


1.46 x 


O" 2 


13 


630 


MP:0000950 


abnormal seizure response to pharmacological agent 


1.53 X 


O" 2 


6 


99 


MP:0002945 


Abnormal inhibitory postsynaptic currents 


1.58 X 


O" 2 


5 


58 


MP:0004008 


Abnormal GABA-mediated receptor currents 


2.85 X 


O" 2 


3 


I I 


MP:0009747 


Impaired behavioral response to xenobiotic 


3.45 X 


O" 2 


5 


68 


MP:0004747 


Abnormal cochlear OHC afferent innervation pattern 


348 X 


O" 2 


2 


2 



number of individual gene members from the con- 
structed networks have strong prior validation as candi- 
date genes that influence alcoholism traits in humans or 
behavioral responses to ethanol in animal models. These 
validated genes serve to increase the probability for the 
entire gene network playing a role in ethanol responses. 

Although gene targeting approaches in animal models 
might ultimately be the most robust method for validat- 
ing the role of individual genes in ethanol response beha- 
viors, such studies are complex and time-consuming. We 
chose, as an initial approach to validate our cross-species 
ranking algorithm, a study of the association of the gene 
ranking score with alcoholism traits in a GWAS analysis. 
We found a reduction in the minimum FDR q-value as 
the ranking score increased to 2. It is important to note 
that this effect is not due to the progressive limiting of 
markers examined. In this study, FDR is not dependent 
on the number of tests performed. 



Although the results are encouraging, the limitations of 
the current analysis and possible improvements must be 
noted. We noted that when the gene rank score cutoff 
increased from 2.0 to 2.5, the size of the q-values 
reversed. This observation might be attributed to overly 
restricted gene selection given that number of SNPs in 
genes dropped from 2293 in 47 genes to 210 in only 6 
genes. Another limitation is that the use of genes from 
the addiction/alcoholism array represents hypotheses 
about important genes, as selected by expert review, 
rather than selection from empirical association data. We 
could improve the current approach in the following 
ways. First, although we included seven datasets in the 
gene ranking, many additional datasets now exist or will 
be released in the near future that may be used in multi- 
species data integration. Additionally, this single GWAS 
dataset is likely to be underpowered given the recent evi- 
dence showing many loci of small effect influence most 



Table 9 PubMed citations significantly over-represented with top candidate genes by ToppFun. 

PubMed Description p-value Terms in Terms in 
ID query publication 

19874574 Genetical genomic determinants of alcohol consumption in rats and humans. < 1 x 10~ 6 24 130 

20468064 Association study of 182 candidate genes in anorexia nervosa. < 1 x 10~ 6 15 182 

18985723 GABA neurotransmitter signaling in the developing mouse lens: dynamic regulation of < 1 x 10~ 6 7 18 
components and functionality. 

21205893 TrkB receptor controls striatal formation by regulating the number of newborn striatal < 1 x 10~ 6 6 12 
neurons. 

16987237 Reduced expression of neuropeptide genes in a genome-wide screen of a < 1 x 10~ 6 8 67 
secretion-deficient mouse. 
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complex human traits. However, a network or pathway 
analysis approach to analyze a set of genes might improve 
power [12]. 

While there are undoubtedly numerous ways to score 
or weight genes, we have shown that this simple empiri- 
cal approach is effective. Our results demonstrate the 
utility of gene ranking after cross-species data integra- 
tion. Since this initial study demonstrated the utility of 
this approach, we are continuing to expand the number 
of data sets and improve the scoring scheme through a 
more sophisticated optimization of weighting para- 
meters. As more data is included, additional alcohol 
GWAS results become available, and more sophisticated 
gene ranking algorithms are developed, we expect 
improvement in specificity and sensitivity. For example, 
there are many gene expression studies in rat brain 
from animals evaluated for alcohol-preference behavior 
[2,28-31], and they will be integrated in future gene 
ranking. However, our initial gene targeting experiments 
in animal models, using the ranked gene lists derived in 
this study, have already identified several novel genes 
that alter ethanol response behaviors in mice, Droso- 
phila or C. elegans (unpublished data). This provides 
direct support of our cross-species gene ranking. 

Conclusion 

In this study, we proposed a cross-species, evidence- 
based gene ranking strategy and demonstrated it in the 
eight alcoholism or ethanol response related datasets 
from four species (human, mouse, fly, and worm). 
Through the use of permutation and FDR analyses, we 
evaluated 10 weighting score matrices and found that 
one of them had the best performance. Using this opti- 
mal weighting matrix, we selected 47 genes whose 
scores were greater than 2 for follow up bioinformatics 
analysis. Functional enrichment tests revealed that these 
47 genes are involved in brain responses to ethanol and 
neural adaptations occurring with alcoholism. Our 
results, with further experimental validation in three 
animal models, suggest that our approach is useful for 
cross-species gene prioritization. 

Materials and methods 

Cross species gene ranking 

In an effort to populate an inclusive gene list with non- 
biased data from at least two species, we used published 
microarray gene expression data from our own and 
other laboratories. As shown in Figure 1, microarray 
gene expression data was used from three sources: acute 
responses to ethanol in C57BL/6 and DBA2/J mice 
(whole genome analysis of samples from reference [14]) 
that had been supplemented with additional microarray 
studies (U74B and U74C arrays, Affymetrix) on the 
same samples, a meta-analysis of genes involved in 



ethanol preference drinking across multiple mouse 
strains [3], and analysis of prefrontal cortex from autop- 
sied samples of alcoholic and non-alcoholic brains [15]. 
We then merged these datasets by utilizing gene homol- 
ogy mapping features within the WebGestalt [32]. This 
produced a list of 2458 genes. These genes were ranked 
by scores resulting from the following algorithm: 

S = Wi(MuAc) + W2(MuPref) + W3(HuAlc) + W4(HuAddChip) 
+W5(HuLink) + Ws(Ce) + Wy(Dr) + Wg (Cross). 

The symbols refer to sources diagramed in Figure 1. 
MuAc, MuPref and HuAlc refer to presence in the 
microarray studies mentioned above. HuAddChip are 
selected genes from human association studies on alcohol 
dependence using the "addiction chip" designed by David 
Goldman and colleagues at the National Institute on 
Alcohol Abuse and Alcoholism (NIAAA) [19]. HuLink 
refers to genes contained within linkage regions that have 
been implicated multiple times across human studies of 
alcohol-related phenotypes on chromosomes 1, 4, and 7 
[16-18]. The region on chromosome 1 ranges from 
D1S1613 at 64,007,000 bp to D1S2624 at 154,898,000 bp 
(according to HapMap build 36) and encompasses a vari- 
ety of overlapping linkage signals to alcohol-related phe- 
notypes, including alcohol dependence, heavy drinking, 
sensitivity to alcohol, and tolerance, across a number of 
samples [17,18,33-38]. The chromosome 4 region ranges 
from D4S2382 at 39,727,200 bp to D4S1615 at 
128,429,200 bp and encompasses linkage peaks from four 
independent samples [16,18,39-42]. The chromosome 7 
region ranged from D7S691 at 41,996,200 bp to D7S1817 
at 109,026,000 bp and constitutes the strongest linkage 
region in the Collaborative Study of the Genetics of Alco- 
holism (COGA) sample [18,43-45]. The invertebrate gene 
sets are from published studies in C. elegans [22] and 
Drosophila [20,21]. Finally, the "Cross" term is a bonus 
score added for cross-species hits for a given gene except 
for genes from human linkage regions. The weighting 
terms w, (i = 1, 2, 3, ... 8) were empirically chosen with 
0.5 or 1.0 in 10 different weighting score matrices (Table 
1). After a permutation test with COGA GWAS data, we 
found the weighting score matrix 3 could provide the 
best performance. 

Analysis of COGA alcohol dependence GWAS dataset 

The COGA GWAS dataset was used to evaluate the 
gene rankings. It contains 1205 cases and 700 controls 
[23]. All cases met DSM-IV criteria for alcohol depen- 
dence. Controls were defined as individuals who have 
consumed alcohol, but did not meet any definition of 
alcohol dependence or alcohol abuse, nor did they meet 
any DSM-IIIR or DSM-IV definition of abuse or depen- 
dence for other drugs (except nicotine). The Illumina 
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human 1M chipset was used for genotyping. Only DNA 
samples achieving a call rate of > 95% were included. A 
total of 1,041,465 SNP markers were used for case-con- 
trol analyses. We conducted population stratification 
and association analyses using PLINK, a highly flexible, 
fast, and user-friendly package for GWAS analysis [46]. 
In our analyses we included only SNPs if their genome- 
wide failure rate did not exceed 0.05. SNPs were further 
excluded if minor allele frequency was less than 0.01. 
After these data filtering processes, 958,380 SNPs were 
used for further analyses. Then, we mapped these SNPs 
to non-pseudogenes in the NCBI Entrez Gene database. 
Specifically, a SNP belongs to a gene if it locates in the 
region within 10 kb upstream to 10 kb downstream of 
the gene. 

FDR control 

To control the risk of false discoveries in GWAS stu- 
dies, for each p-value, we calculated a q-value [47,48]. A 
q-value is an estimate of the proportion of false discov- 
eries among all significant markers (i.e., q-values are 
FDRs) when the corresponding p-value is used as the 
threshold to declare significance. As argued previously 
[49], we preferred this approach to more traditional 
multiple testing methods that control the probability of 
producing one or more false discoveries for a set of 
tests [50]. Our approach was preferred because these q- 
values 1) provide a better balance between the compet- 
ing goals of finding true positives versus controlling 
false discoveries, 2) allow the use of more similar stan- 
dards in terms of the proportion of false discoveries pro- 
duced across studies due to much less dependence on 
the number of tests performed, 3) are relatively robust 
against the effects of correlated tests [47,49,51-56], and 
4) rather than an all-or-nothing conclusion about 
whether a study produces significant results, instead 
provide a more subtle picture about the possible rele- 
vance of the tested markers. The FDR procedure is per- 
formed in the R statistical package. 

Random permutation for different score matrix 

To test the significance of the gene ranking enrichment 
result for each weighting score matrix, we did 100 ran- 
dom permutations for the q-value enrichment. Since 
longer genes tend to have more SNPs in GWAS data, to 
reduce this gene length bias, we restricted the gene 
length of the random selections in each permutation 
within ± 50 kb of the average length of our ranked 
genes. We set the permutation p-value as the proportion 
of permutation times in which there are higher q-value 
proportions in randomly selected genes than in our 
ranked genes in the corresponding score region. For 
example, there were n genes with a score s under a spe- 
cific weighting score matrix. Then, in each permutation, 



n genes were selected from all human genes whose 
length is ± 50 kb of the average length of the n ranked 
genes. The q-values for SNPs in the randomly selected 
genes were calculated based on the GWAS data. For 
simplicity of comparison, we compared the number of 
SNPs in each q-value range (e.g., < 0.9, < 0.8, etc.). The 
proportion of the q-value number for each q-value 
range in randomly selected genes was then calculated. If 
the proportion was larger than our ranked alcohol genes 
at the same q-value range, we counted this permutation 
as a "significant permutation" for the specific score 
range s and q-value range. After 100 permutations, the 
proportion of "significant permutation" was set as the p- 
value of our permutation result at the corresponding 
score and q-value range. For the weighting score matrix 
with the best performance, permutation testing was 
repeated 1000 times again to check the significance. 

Bioinformatics analysis of cross-species ranked gene list 

The 47 top ranked genes with a score > 2 were exam- 
ined for enriched GO terms using the WebGestalt 
online tool (version 2) [32]. This tool examines the 
over-representation of genes of interest in GO terms 
based on the hypergeometric test followed by the Benja- 
mini-Hochberg (1995) adjustment of p-values [50]. We 
then used the ToppFun online tool [57], which is an 
integrated over-representation analysis tool, to interro- 
gate databases for biological pathways, mouse pheno- 
types, and PubMed citations. 

Additional material 



Additional file 1: Summary of the number and proportion of q- 
values, and p-value of 100 permutation results for different ranked 
score and q-value range under each of the 10 weighting score 
matrices. 
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